Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3210

[Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.10.0
    • Python
    • Ubuntu 16.04 LTS, System76 Oryx Pro

    Description

      STEPS TO REPRODUCE:

      1. Create a conda environment reflecting environment.yml

      2. Execute script repro.py, replacing various config variables to create a ParquetDataset on S3 given repro.csv

      3. Create reference of ParquetDataset using script repro_2.py, again replacing various config variables.

       

      EXPECTED:

      Reference is created correctly.

      GOT:

      Mismatched Arrow schemas in validate_schemas() method:

       

      ```python

          • ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
            Record_ID: int64
            y: double
            TRACKID: string
            MMSI: int64
            IMO: int64
            AgeMinutes: double
            SoG: double
            Width: int64
            Length: int64
            Callsign: string
            Destination: string
            ETA: int64
            Status: string
            ExtraInfo: string
            TIMESTAMP: int64
            _index_level_0_: int64
            metadata
            --------
            {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [ {"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}

            ,

            {"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}

            ,

            {"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

            ,

            {"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}

            ,

            {"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}

            ,

            {"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}

            ,

            {"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}

            '
            b',

            {"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}

            ,

            {"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}

            ,

            {"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

            ,

            {"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}

            ,

            {"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}

            ,

            {"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}

            ,

            {"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' b', "metadata": null}

            ,

            {"name": "TIMESTAMP", "field_name": "TIMEST' b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' b' null}

            ,

            {"name": null, "field_name": "__index_level_0__", "panda' b's_type": "int64", "numpy_type": "int64", "metadata": null}

            ], "pa'
            b'ndas_version": "0.21.0"}'}

      vs

      Record_ID: int64
      y: double
      TRACKID: string
      MMSI: int64
      IMO: int64
      AgeMinutes: double
      SoG: double
      Width: int64
      Length: int64
      Callsign: string
      Destination: string
      ETA: int64
      Status: string
      ExtraInfo: null
      TIMESTAMP: int64
      _index_level_0_: int64
      metadata
      --------
      {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
      b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
      b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
      b' [

      {"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}

      ,

      {"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}

      ,

      {"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

      ,

      {"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}

      ,

      {"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}

      ,

      {"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}

      ,

      {"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}

      '
      b',

      {"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}

      ,

      {"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}

      ,

      {"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

      ,

      {"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}

      ,

      {"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}

      ,

      {"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}

      ,

      {"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", ' b'"metadata": null}

      ,

      {"name": "TIMESTAMP", "field_name": "TIMESTAM' b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n' b'ull}

      ,

      {"name": null, "field_name": "__index_level_0__", "pandas_' b'type": "int64", "numpy_type": "int64", "metadata": null}

      ], "pand'
      b'as_version": "0.21.0"}'}

      ```

      The issue is with column ExtraInfo, where pandas_type is unicode in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has pandas_type empty for that same column.

      Attachments

        1. environment.yml
          3 kB
          Ying Wang
        2. repro.csv
          4 kB
          Ying Wang
        3. repro_2.py
          0.2 kB
          Ying Wang
        4. repro.py
          0.4 kB
          Ying Wang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yingw787 Ying Wang
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: