Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.8.0
Description
Unicode columns in pandas DataFrames aren't being handled correctly for some datasets when reading a parquet file into a pandas DataFrame, leading to the common Python ASCII encoding error.
The dataset used to get the error is here: https://catalog.data.gov/dataset/college-scorecard
import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.read_csv('college_data.csv')
For verification, the DataFrame's columns are indeed unicode
df.columns > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR', u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG', ... u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6', u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP', u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'], dtype='object', length=123)
The DataFrame can be saved into a parquet file
arrow_table = pa.Table.from_pandas(df)
pq.write_table(arrow_table, 'college_data.parquet')
But trying to read the parquet file immediately afterwards results in the following
df = pq.read_table('college_data.parquet').to_pandas() > --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-29-23906ea1efe3> in <module>() ----> 2 df = pq.read_table('college_data.parquet').to_pandas() /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)() 1041 if nthreads is None: 1042 nthreads = cpu_count() -> 1043 mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, 1044 nthreads) 1045 return pd.DataFrame(mgr) /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals) 539 if columns: 540 columns_name_dict = { --> 541 c.get('field_name', str(c['name'])): c['name'] for c in columns 542 } 543 columns_values = [ /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in <dictcomp>((c,)) 539 if columns: 540 columns_name_dict = { --> 541 c.get('field_name', str(c['name'])): c['name'] for c in columns 542 } 543 columns_values = [ UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Looking at the stacktrace , it looks like this line, which is using str which by default will try to do ascii encoding: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541
Attachments
Issue Links
- is duplicated by
-
ARROW-1981 UnicodeEncodeError for column name in pandas_compat.py
- Closed
- links to