Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10532

[Python] Mangled pandas_metadata when specified schema has different order as DataFrame columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • Python
    • Ubuntu 20.04 with Python 3.8.6 from miniconda / conda-forge

    Description

      When calling pyarrow.Table.from_pandas() with an explicit schema, the ordering of the columns in the dataframe and the schema have to be identical, because the pandas_metadata fields are associated with columns on the basis of the ordering, rather than the name of their column. If the ordering of the dataframe columns and schema fields isn't identical, then you end up associating metadata with the wrong fields, which leads to all kinds of errors.

       

      import pyarrow as pa
      import pandas as pd
      import numpy as np
      
      data_col = np.random.random_sample(2)
      datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)
      
      data_field = pa.field("data_col", pa.float32(), nullable=True)
      datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"), nullable=False)
      
      df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})
      
      good_schema = pa.schema([datetime_field, data_field])
      bad_schema = pa.schema([data_field, datetime_field])
      
      pa.Table.from_pandas(df, preserve_index=False, schema=good_schema).schema.pandas_metadata
      #{'index_columns': [],
      # 'column_indexes': [],
      # 'columns': [{'name': 'datetime_utc',
      #   'field_name': 'datetime_utc',
      #   'pandas_type': 'datetimetz',
      #   'numpy_type': 'datetime64[ns]',
      #   'metadata': {'timezone': 'UTC'}},
      #  {'name': 'data_col',
      #   'field_name': 'data_col',
      #   'pandas_type': 'float32',
      #   'numpy_type': 'float64',
      #   'metadata': None}],
      # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
      # 'pandas_version': '1.1.4'}
      
      pa.Table.from_pandas(df, preserve_index=False, schema=bad_schema).schema.pandas_metadata
      #{'index_columns': [],
      # 'column_indexes': [],
      # 'columns': [{'name': 'data_col',
      #   'field_name': 'data_col',
      #   'pandas_type': 'float32',
      #   'numpy_type': 'datetime64[ns]',
      #   'metadata': {'timezone': 'UTC'}},
      #  {'name': 'datetime_utc',
      #   'field_name': 'datetime_utc',
      #   'pandas_type': 'datetimetz',
      #   'numpy_type': 'float64',
      #   'metadata': None}],
      # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
      # 'pandas_version': '1.1.4'}
      

       

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              zaneselvans Zane Selvans
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m