Unicode columns in pandas DataFrames aren't being handled correctly for some datasets when reading a parquet file into a pandas DataFrame, leading to the common Python ASCII encoding error.
The dataset used to get the error is here: https://catalog.data.gov/dataset/college-scorecard
For verification, the DataFrame's columns are indeed unicode
The DataFrame can be saved into a parquet file
But trying to read the parquet file immediately afterwards results in the following
Looking at the stacktrace , it looks like this line, which is using str which by default will try to do ascii encoding: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541