Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2860

[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: Python
    • Labels:

      Description

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      
      from datetime import datetime, timedelta
      
      
      def generate_data(event_type, event_id, offset=0):
          """Generate data."""
          now = datetime.utcnow() + timedelta(seconds=offset)
          obj = {
              'event_type': event_type,
              'event_id': event_id,
              'event_date': now.date(),
              'foo': None,
              'bar': u'hello',
          }
          if event_type == 2:
              obj['foo'] = 1
              obj['bar'] = u'world'
          if event_type == 3:
              obj['different'] = u'data'
              obj['bar'] = u'event type 3'
          else:
              obj['different'] = None
          return obj
      
      
      data = [
          generate_data(1, 1, 1),
          generate_data(1, 1, 3600 * 72),
          generate_data(2, 1, 1),
          generate_data(2, 1, 3600 * 72),
          generate_data(3, 1, 1),
          generate_data(3, 1, 3600 * 72),
      ]
      
      df = pd.DataFrame.from_records(data, index='event_id')
      table = pa.Table.from_pandas(df)
      
      pq.write_to_dataset(table, root_path='/tmp/events', partition_cols=['event_type', 'event_date'])
      
      dataset = pq.ParquetDataset('/tmp/events')
      table = dataset.read()
      print(table.num_rows)
      

      Expected output:

      6
      

      Actual:

      python example_failure.py
      Traceback (most recent call last):
        File "example_failure.py", line 43, in <module>
          dataset = pq.ParquetDataset('/tmp/events')
        File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 745, in __init__
          self.validate_schemas()
        File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 775, in validate_schemas
          dataset_schema))
      ValueError: Schema in partition[event_type=2, event_date=0] /tmp/events/event_type=3/event_date=2018-07-16 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
      bar: string
      different: string
      foo: double
      event_id: int64
      metadata
      --------
      {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
      
      vs
      
      bar: string
      different: null
      foo: double
      event_id: int64
      metadata
      --------
      {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
      

      Apparently what is happening is that pyarrow is interpreting the schema from each of the partitions individually and the partitions for `event_type=3 / event_date=*` both have values for the column `different` whereas the other columns do not. The discrepancy causes the `None` values of the other partitions to be labeled as `pandas_type` `empty` instead of `unicode`.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                samo Sam Oluwalana
              • Votes:
                2 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated: