Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3766

[Python] pa.Table.from_pandas doesn't use schema ordering

    XMLWordPrintableJSON

    Details

      Description

      Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
      With the function pa.Table.from_pandas(dataframe, schema=my_schema) we can apply a schema to a dataframe. I noticed that the returned pa.Table object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table).

      This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for pa.Table.from_pandas, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema.

      Here is a little example. If you remove the reordering of df2 everything works fine:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      import os
      import numpy as np
      import shutil
      
      PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
      
      if os.path.exists(PATH_PYARROW_MANUAL):
          shutil.rmtree(PATH_PYARROW_MANUAL)
      os.mkdir(PATH_PYARROW_MANUAL)
      
      arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
      strings = np.array([np.nan, np.nan, 'a', 'b'])
      
      df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
      df.index.name='DPRD_ID'
      df['arrays'] = pd.Series(arrays)
      df['strings'] = pd.Series(strings)
      
      my_schema = pa.schema([('DPRD_ID', pa.int64()),
                             ('partition_column', pa.int32()),
                             ('arrays', pa.list_(pa.int32())),
                             ('strings', pa.string()),
                             ('new_column', pa.string())])
      
      df1 = df[df.partition_column==0]
      df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
      
      
      table1 = pa.Table.from_pandas(df1, schema=my_schema)
      table2 = pa.Table.from_pandas(df2, schema=my_schema)
      
      pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
      pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
      
      pd.read_parquet(PATH_PYARROW_MANUAL)
      

      If

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kszucs Krisztian Szucs
                Reporter:
                cthi Christian Thiel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m