Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.9.0
-
Ubuntu 16.04 LTS, System76 Oryx Pro
Description
STEPS TO REPRODUCE:
1. Create a conda environment reflecting environment.yml
2. Execute script repro.py, replacing various config variables to create a ParquetDataset on S3 given repro.csv
3. Create reference of ParquetDataset using script repro_2.py, again replacing various config variables.
EXPECTED:
Reference is created correctly.
GOT:
Mismatched Arrow schemas in validate_schemas() method:
```python
-
-
- ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: string
TIMESTAMP: int64
_index_level_0_: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [ {"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null},
{"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null},
{"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null},
{"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null},
{"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null},
{"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null},
{"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
{"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}
b',,
{"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null},
{"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null},
{"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null},
{"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null},
{"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null},
{"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' b', "metadata": null},
{"name": "TIMESTAMP", "field_name": "TIMEST' b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' b' null},
{"name": null, "field_name": "__index_level_0__", "panda' b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
b'ndas_version": "0.21.0"}'}
- ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
-
vs
Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: null
TIMESTAMP: int64
_index_level_0_: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [
,
{"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null},
{"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null},
{"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null},
{"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null},
{"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null},
{"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b',
,
{"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null},
{"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null},
{"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null},
{"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null},
{"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null},
{"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", ' b'"metadata": null},
{"name": "TIMESTAMP", "field_name": "TIMESTAM' b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n' b'ull},
{"name": null, "field_name": "__index_level_0__", "pandas_' b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand'
b'as_version": "0.21.0"}'}
```
The issue is with column ExtraInfo, where pandas_type is unicode in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has pandas_type empty for that same column.
Attachments
Attachments
Issue Links
- is duplicated by
-
ARROW-2891 [Python] Preserve schema in write_to_dataset
- Resolved