Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2799

[Python] Add safe option to Table.from_pandas to avoid unsafe casts

    XMLWordPrintableJSON

Details

    Description

      Ported over from https://github.com/apache/arrow/issues/2217

      ```python
      In [8]: import pandas as pd
      ...: import pyarrow as arw

      In [9]: df = pd.DataFrame(

      {'A': list('abc'), 'B': np.arange(3)}

      )
      ...: df
      Out[9]:
      A B
      0 a 0
      1 b 1
      2 c 2

      In [10]: schema = arw.schema([
      ...: arw.field('A', arw.string()),
      ...: arw.field('B', arw.int32()),
      ...: ])

      In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
      ...: tbl
      Out[11]:
      pyarrow.Table
      A: string
      B: int32
      metadata
      --------
      {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [

      {"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}

      ,

      {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}

      ]'
      b', "pandas_version": "0.23.1"}'}

      In [12]: tbl.to_pandas().equals(df)
      Out[12]: True
      ```
      ...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame.

      Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!
      ```python
      In [13]: df['B'].iloc[0] = 1.23
      ...: df
      Out[13]:
      A B
      0 a 1.23
      1 b 1.00
      2 c 2.00

      In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
      ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
      ...: tbl
      Out[14]:
      pyarrow.Table
      A: string
      B: int32
      metadata
      --------
      {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [

      {"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}

      ,

      {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "float64", "metadata": null' b'}

      ], "pandas_version": "0.23.1"}'}

      In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!!
      Out[15]:
      A B
      0 a 1
      1 b 1
      2 c 2

      ```

      To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              dhirschfeld Dave Hirschfeld
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m