[ARROW-9369] [Python] Support conversion from python sequence to dictionary type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.17.1
Fix Version/s: 2.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25452

Description

Converting from a python sequence with specified target type isn't implemented yet for dictionary type:

In [1]: pa.array(['a', 'b', 'a'], pa.dictionary(pa.int32(), pa.string()))                                                                                                                                          
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-1-bda8628a4917> in <module>
----> 1 pa.array(['a', 'b', 'a'], pa.dictionary(pa.int32(), pa.string()))

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: Sequence converter for type dictionary<values=string, indices=int32, ordered=0> not implemented

Original report

Hello, I am trying to do the following (please correct me if I am doing some non-sense):

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

fields = [pa.field("object", pa.dictionary(pa.int64(), pa.string()))]
data = {"object": { 
                               "a": "a",
                               "b": "b",
                               "c": "c", 
                               "s": "d" }}
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df, pa.schema(fields))
pq.write_table(table, "test.parquet")

and I am getting:

Traceback (most recent call last):
  File "pa_test.py", line 17, in <module>
    table = pa.Table.from_pandas(df, pa.schema(fields))
  File "pyarrow/table.pxi", line 1451, in pyarrow.lib.Table.from_pandas
  File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 575, in dataframe_to_arrays
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 575, in <listcomp>
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 566, in convert_column
    raise e
  File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 560, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('Sequence converter for type dictionary<values=string, indices=int64, ordered=0> not implemented', 'Conversion failed for column object with type object')

Workaround is to use df.to_parquet("test.parquet")

Attachments

Issue Links

links to

GitHub Pull Request #8008

Activity

People

Assignee:: Krisztian Szucs

Reporter:: Tomas Remes

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Jul/20 08:40

Updated:: 11/Jan/23 08:06

Resolved:: 08/Sep/22 06:47

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h