[ARROW-1976] [Python] Handling unicode pandas columns on parquet.read_table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.9.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/15643

Description

Unicode columns in pandas DataFrames aren't being handled correctly for some datasets when reading a parquet file into a pandas DataFrame, leading to the common Python ASCII encoding error.

The dataset used to get the error is here: https://catalog.data.gov/dataset/college-scorecard

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('college_data.csv')

For verification, the DataFrame's columns are indeed unicode

df.columns
> Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR',
       u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG',
       ...
       u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6',
       u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP',
       u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'],
      dtype='object', length=123)

The DataFrame can be saved into a parquet file

arrow_table = pa.Table.from_pandas(df)
pq.write_table(arrow_table, 'college_data.parquet')

But trying to read the parquet file immediately afterwards results in the following

df = pq.read_table('college_data.parquet').to_pandas()

> ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-29-23906ea1efe3> in <module>()
----> 2 df = pq.read_table('college_data.parquet').to_pandas()

/Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)()
   1041         if nthreads is None:
   1042             nthreads = cpu_count()
-> 1043         mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
   1044                                              nthreads)
   1045         return pd.DataFrame(mgr)

/Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
    539     if columns:
    540         columns_name_dict = {
--> 541             c.get('field_name', str(c['name'])): c['name'] for c in columns
    542         }
    543         columns_values = [

/Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in <dictcomp>((c,))
    539     if columns:
    540         columns_name_dict = {
--> 541             c.get('field_name', str(c['name'])): c['name'] for c in columns
    542         }
    543         columns_values = [

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

Looking at the stacktrace , it looks like this line, which is using str which by default will try to do ascii encoding: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541

Attachments

Issue Links

is duplicated by

ARROW-1981 UnicodeEncodeError for column name in pandas_compat.py

Closed

links to

GitHub Pull Request #1476

GitHub Pull Request #1553

Activity

People

Assignee:: Licht Takeuchi

Reporter:: Simbarashe Nyatsanga

Votes:: 1 Vote for this issue

Watchers:: 7 Stop watching this issue

Dates

Created:: 07/Jan/18 22:06

Updated:: 11/Jan/23 07:18

Resolved:: 06/Feb/18 00:27

Agile

View on Board

[Python] Handling unicode pandas columns on parquet.read_table